Final Project - Analyzing Sales Data

Date: 27 January 2023

Author: Tongtai mahasuwan

Course: Pandas Foundation

# import data
import pandas as pd
df = pd.read_csv("sample-store.csv")
# preview top 5 rows
df.head(5)
# shape of dataframe
df.shape
(9994, 21)
# see data frame information using .info()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Row ID 9994 non-null int64
1 Order ID 9994 non-null object
2 Order Date 9994 non-null object
3 Ship Date 9994 non-null object
4 Ship Mode 9994 non-null object
5 Customer ID 9994 non-null object
6 Customer Name 9994 non-null object
7 Segment 9994 non-null object
8 Country/Region 9994 non-null object
9 City 9994 non-null object
10 State 9994 non-null object
11 Postal Code 9983 non-null float64
12 Region 9994 non-null object
13 Product ID 9994 non-null object
14 Category 9994 non-null object
15 Sub-Category 9994 non-null object
16 Product Name 9994 non-null object
17 Sales 9994 non-null float64
18 Quantity 9994 non-null int64
19 Discount 9994 non-null float64
20 Profit 9994 non-null float64
dtypes: float64(4), int64(2), object(15)
memory usage: 1.6+ MB

We can use pd.to_datetime() function to convert columns 'Order Date' and 'Ship Date' to datetime.

# example of pd.to_datetime() function
pd.to_datetime(df['Order Date'].head(), format='%m/%d/%Y')
TODO - convert order date and ship date to datetime in the original dataframe
df['Order Date'] = pd.to_datetime(df['Order Date'], format='%m/%d/%Y')
df['Ship Date'] = pd.to_datetime(df['Ship Date'], format='%m/%d/%Y')
df[['Order Date','Ship Date']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order Date 9994 non-null datetime64[ns]
1 Ship Date 9994 non-null datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 156.3 KB
TODO - count nan in postal code column
df.isnull().sum()
(df == '').sum()
TODO - filter rows with missing values
TODO - Explore this dataset on your owns, ask your own questions

Data Analysis Part

Answer 10 below questions to get credit from this course. Write pandas code to find answers.

TODO 01 - how many columns, rows in this dataset
df.shape
(9994, 21)
TODO 02 - is there any missing values?, if there is, which colunm? how many nan values?
df.isnull().sum()
TODO 03 - your friend ask for `California` data, filter it and export csv for him
store_California = df[df['State'] =='California']
store_California.to_csv('store_California.csv')
TODO 04 - your friend ask for all order data in `California` and `Texas` in 2017 (look at Order Date), send him csv file
store_California_Texas_2017 = df[(df['State'] =='Texas') | (df['State'] =='California') & (df['Order Date'].dt.year == 2017)]
store_California_Texas_2017.to_csv('store_California_Texas_2017.csv')
TODO 05 - how much total sales, average sales, and standard deviation of sales your company make in 2017
df['Sales'][df['Order Date'].dt.year == 2017].agg(['sum','mean','std'])
TODO 06 - which Segment has the highest profit in 2018
df[df['Order Date'].dt.year == 2018].groupby('Segment')['Profit'].agg("sum").sort_values(ascending =False).head(1)
TODO 07 - which top 5 States have the least total sales between 15 April 2019 - 31 December 2019
df[(df['Order Date'] > '2019-04-15') & (df['Order Date'] < '2019-12-31')].groupby('State')['Sales'].agg("sum")\
    .sort_values(ascending =False).head(5)
TODO 08 - what is the proportion of total sales (%) in West + Central in 2019 e.g. 25% 
Sale_wc_2019 = df[((df['Region'] == 'West') | (df['Region'] == 'Central')) & (df['Order Date'].dt.year == 2019)]['Sales'].agg('sum')
Sale_2019 = df[(df['Order Date'].dt.year == 2019)]['Sales'].agg('sum')
print(((Sale_wc_2019/Sale_2019) *100),"%")
54.97479891837763 %
TODO 09 - find top 10 popular products in terms of number of orders vs. total sales during 2019-2020
df[(df['Order Date'].dt.year >= 2019) & (df['Order Date'].dt.year <= 2020)].groupby('Product Name')['Quantity']\
.agg('sum').sort_values(ascending = False).head(10)
df[(df['Order Date'].dt.year >= 2019) & (df['Order Date'].dt.year <= 2020)].groupby('Product Name')['Sales']\
    .agg('sum').sort_values(ascending = False).head(10)
TODO 10 - plot at least 2 plots, any plot you think interesting :)
df[['Sales''Profit']].plot(x='Sales'y='Profit'kind="scatter"color="lightblue")
<Axes: xlabel='Sales', ylabel='Profit'>
df['Segment'].value_counts().plot(kind='bar')
<Axes: >
TODO Bonus - use np.where() to create new column in dataframe to help you answer your own questions
# My qustions:Group observations according to profit.
df['Profit'].plot(kind='box')
<Axes: >
import numpy as np
df['New_Profit'] = np.where(df['Profit'] < 0,"Very Bad","Bad")
df['New_Profit'] = np.where(df['Profit'] > 25,"Ok",df['New_Profit'])
df['New_Profit'] = np.where(df['Profit'] > 100,"Good",df['New_Profit'])
df['New_Profit'].value_counts()